Wikipedia-based Compact Hierarchical Semantics for Natural Language Processing

نویسنده

  • SOFIA LIBERMAN
چکیده

A correct semantic representation of words and texts underlies many text processing tasks such as text categorization, word sense disambiguation, and semantic relatedness assessment. It has long been recognized that computers require access to common-sense and domain-specific world knowledge in order to process textual data at a deeper level. In this paper, we present a novel representation of semantics that is based on the structured encyclopedic knowledge encoded within Wikipedia articles and categories and the conceptual hierarchy inferred from this knowledge base. Our method, called Compact Hierarchical Explicit Semantic Analysis (CHESA), generates hierarchical semantic representations of unrestricted natural language texts. It represents semantics as a compact hierarchical structure of predefined natural concepts, capturing semantics at different abstraction levels and constructing representations at any given size, depending on the task at hand. In comparison to previous methods, CHESA generates very intuitive and comprehensible representations that allow deep semantic reasoning and understanding. We present a methodology to compute semantic relatedness using CHESA representations and evaluate CHESA on the task of semantic relatedness assessment of words and texts. Empirical results show that, for compact representations, CHESA is superior to the previous state of the art. 1 T ec hn io n C om pu te r Sc ie nc e D ep ar tm en t M .S c. T he si s M SC -2 01 002 2 01 0

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Wikipedia-based Compact Hierarchical Semantics with Application to Semantic Relatedness

A proper semantic representation of words and texts underlies many text processing tasks. In this paper, we present a novel representation of semantics which is based on an hierarchical ontology of natural concepts derived from Wikipedia articles and category system. Our method, called Compact Hierarchical Explicit Semantic Analysis (CHESA) generates compact hierarchical representations of unre...

متن کامل

Using Wikipedia for Hierarchical Finer Categorization

Wikipedia is one of the largest growing structured resources on the Web and can be used as a training corpus in natural language processing applications. In this work, we present a method to categorize named entities under the hierarchical fine-grained categories provided by the Wikipedia taxonomy. Such a categorization can be further used to extract semantic relations among these named entitie...

متن کامل

Wikipedia-based Semantic Interpretation for Natural Language Processing

Adequate representation of natural language semantics requires access to vast amounts of common sense and domain-specific world knowledge. Prior work in the field was based on purely statistical techniques that did not make use of background knowledge, on limited lexicographic knowledge bases such as WordNet, or on huge manual efforts such as the CYC project. Here we propose a novel method, cal...

متن کامل

Computing semantic relatedness of words and texts in Wikipedia-derived semantic space

Adequate representation of natural language semantics requires access to vast amounts of common sense and domain-specific world knowledge. Prior work in the field was either based on purely statistical techniques that did not make use of background knowledge or on huge manual efforts, such as the CYC projects. Here we propose a novel method, called Explicit Semantic Analysis (ESA), for finegrai...

متن کامل

How to Add a New Language on the NLP Map: Building Resources and Tools for Languages with Scarce Resources

Those of us whose mother tongue is not English or are curious about applications involving other languages, often find ourselves in the situation where the tools we require are not available. According to recent studies there are about 7200 different languages spoken worldwide – without including variations or dialects – out of which very few have automatic language processing tools and machine...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010